Discovering Frequent Substructures from Hierarchical Semi-structured Data
نویسندگان
چکیده
Frequent substructure discovery from a collection of semi-structured objects can serve for storage, browsing, querying, indexing and classification of semi-structured documents. This paper examines the problem of discovering frequent substructures from a collection of hierarchical semi-structured objects of the same type. The use of wildcard is an important aspect of substructure discovery from semi-structured data due to the irregularity and lack of fixed structure of such data. This paper proposes a more general and powerful wildcard mechanism, which allows us to find more complex and interesting substructures than existing techniques. Furthermore, the complexity of structural information of semi-structured data and the usage of wildcard make the existing frequent set mining algorithms inapplicable for substructure discovery. In this work, we adopt a vertical format for the storage of semi-structured objects, and adapt a frequent set mining algorithm for our purpose. The application of our approach to real-life data shows that it is very effective.
منابع مشابه
Efficient Algorithms for Discovering Frequent and Maximal Substructures from Large Semistructured Data
In this paper, we review recent advances in efficient algorithms for semi-structured data mining , that is, discovery of rules and patterns from structured data such as sets, sequences, trees, and graphs. After introducing basic definitions and problems, We present efficent algorithms for frequent and maximal pattern mining for classes of sets, sequences, and trees. In particular, we explain ge...
متن کاملDiscovering Frequent Substructures in Large Unordered Trees
In this paper, we study a frequent substructure discovery problem in semi-structured data. We present an efficient algorithm Unot that computes all frequent labeled unordered trees appearing in a large collection of data trees with frequency above a user-specified threshold. The keys of the algorithm are efficient enumeration of all unordered trees in canonical form and incremental computation ...
متن کاملEfficient Tree Mining Using Reverse Search
In this paper, we review our data mining algorithms for discovering frequent substructures in a large collection of semi-structured data, where both of the patterns and the data are modeled by labeled trees. These algorithms, namely FREQT for mining frequent ordered trees and UNOT for mining frequent unordered trees, efficiently enumerate all frequent tree patterns without duplicates using reve...
متن کاملEfficient Substructure Discovery from Large Semi-structured Data
By rapid progress of network and storage technologies, a huge amount of electronic data such as Web pages and XML data [23] has been available on intra and internet. These electronic data are heterogeneous collection of ill-structured data that have no rigid structures, and often called semi-structured data [1]. Hence, there have been increasing demands for automatic methods for extracting usef...
متن کاملDiscovering Associations in XML Data
Knowledge inference from semi-structured data can utilize frequent sub structures, in addition to frequency of data items. In fact, the working assumption of the present study is that frequent sub-trees of XML data represent sets of tags (objects) that are meaningfully associated. A method for extracting frequent sub-trees from XML data is presented. It uses thresholds on frequencies of paths a...
متن کامل